This material is adapted from an R for RNA-Seq workshop first run here.
This course has been designed to introduce biologists to R for RNA-Seq analysis. The focus here is on using tidyverse to analyse RNA-Seq data, as we believe this is a productive and engaging way to learn R for RNA-Seq analysis. In this course we will use some new packages, tidyBulk and tidyHeatmap. These packages provide a friendly tidyverse-style way to perform analysis of RNA-Seq data.
Data files are available from the data folder in GitHub here. You should download the files listed below and place them into a folder called data in your working directory.
Data files:
Packages used:
Here we will use packages from the 3 main repositories of R packages: Bioconductor, CRAN and GitHub. To install the packages you can follow the steps below.
if (!requireNamespace("BiocManager"))
install.packages("BiocManager")
BiocManager::install("edgeR")
Install the CRAN packages with the command below.
install.packages(c("tidyverse", "devtools"))
tidyBulk will be added to Bioconductor and when it is added there you can install it using the usual Bioconductor commands. tidyHeatmap will be added to CRAN and when it is there you can install it with the install.packages() command. In the meantime you can install tidyBulk and tidyHeatmap from their development sites in Github with the commands below.
# install tidyBulk
devtools::install_github("stemangiola/tidyBulk@dev")
# install tidyHeatmap
devtools::install_github("stemangiola/tidyHeatmap")
TODO: Maybe add workflow image showing steps and functions used
Measuring gene expression on a genome-wide scale has become common practice over the last two decades or so, with microarrays predominantly used pre-2008. With the advent of next generation sequencing technology in 2008, an increasing number of scientists use this technology to measure and understand changes in gene expression in often complex systems. As sequencing costs have decreased, using RNA-Seq to simultaneously measure the expression of tens of thousands of genes for multiple samples has never been easier. The cost of these experiments has now moved from generating the data to storing and analysing it.
There are many steps involved in analysing an RNA-Seq experiment. Analysing an RNAseq experiment begins with sequencing reads. These are aligned to a reference genome, then the number of reads mapped to each gene can be counted. This results in a table of counts, which is what we perform statistical analyses on in R. While mapping and counting are important and necessary tasks, today we will be starting from the count data and getting stuck into analysis.
First, let’s load all the packages we will need to analyse the data.
# load libraries
library(tidyverse)
library(tidyBulk)
library(tidyHeatmap)
In this tutorial, we will learn some R through creating plots to visualise data from an RNA-Seq experiment. RNA-Seq counts file can be obtained from the GREIN platform. GREIN stands for GEO RNA-Seq Experiments Interactive Navigator and provides >6,500 published datasets from GEO that have been uniformly processed. It is available at http://www.ilincs.org/apps/grein/. You can search for a dataset of interest using the GEO code. We obtained the dataset used here using the code GSE60450. GREIN provide QC metrics for the RNA-Seq datasets and both raw and normalised counts. We will use the raw counts here. Generally, the higher the number of counts the more the gene is expressed.
Here we will perform RNA-Seq analysis using data from a breast cancer research study, from the paper by Fu et al. 2015, GEO code GSE60450. This study examined gene expression in basal and luminal cells from mice at different stages of mammary gland development (virgin, pregnant and lactating). There are 2 samples per group and 6 groups, 12 samples in total.
Set up an RStudio project specifying the directory where you have saved the /data directory. Open a new script for this workshop File > New File > R Script. Save it as e.g. intro-rnaseq.R.
# read in counts file
counts <- read_csv("data/GSE60450_GeneLevel_Raw_data.csv")
Missing column names filled in: 'X1' [1]Parsed with column specification:
cols(
X1 = [31mcol_character()[39m,
gene_symbol = [31mcol_character()[39m,
GSM1480291 = [32mcol_double()[39m,
GSM1480292 = [32mcol_double()[39m,
GSM1480293 = [32mcol_double()[39m,
GSM1480294 = [32mcol_double()[39m,
GSM1480295 = [32mcol_double()[39m,
GSM1480296 = [32mcol_double()[39m,
GSM1480297 = [32mcol_double()[39m,
GSM1480298 = [32mcol_double()[39m,
GSM1480299 = [32mcol_double()[39m,
GSM1480300 = [32mcol_double()[39m,
GSM1480301 = [32mcol_double()[39m,
GSM1480302 = [32mcol_double()[39m
)
# read in metadata
sampleinfo <- read_csv("data/GSE60450_filtered_metadata.csv")
Missing column names filled in: 'X1' [1]Parsed with column specification:
cols(
X1 = [31mcol_character()[39m,
characteristics = [31mcol_character()[39m,
immunophenotype = [31mcol_character()[39m,
`developmental stage` = [31mcol_character()[39m
)
Let’s take a look at the data. You can type the name of the object to view the first few lines and to see how many rows and columns it has.
counts
The counts object contains information about genes (one gene per row), the first column has the Ensembl gene id, the second has the gene symbol and the remaining columns contain information about the number of reads aligning to the gene in each experimental sample. Note the gene counts here are not integers as they’re estimated counts from salmon (see here: https://support.bioconductor.org/p/101156/). There are two replicates for each cell type and time point (detailed sample info can be found in file “GSE60450_series_matrix.txt” from the GEO website). The sampleinfo metadata file contains basic information about the samples that we will need for the analysis today.
First we will convert the counts into long format (tidy format), similar to what we did in the Intro to R session.
# convert to tidy format
counts <- pivot_longer(counts, cols = starts_with("GSM"), names_to = "sample", values_to = "count")
# take a look
counts
We will next extract just the columns we need, sample, gene_symbol, count. To do this we will use the tidyverse pipe %>%. This ‘pipes’ the output from the command on the left into the command on the right/below. Using the pipe is not essential but it reduces the amount of code we need to write when we have multiple steps (as we’ll see later). It also can make the steps clearer and easier to see. For more details on the pipe see here.
# using pipe
counts <- counts %>%
select(sample, gene_symbol, count, X1)
# take a look
counts
Take a look at the sampleinfo file. The first column “X1” contains the sample ids, the second “characteristics” contains the specific group the sample belongs to (e.g. mammary gland, luminal cells, virgin), the third column “immunophenotype” contains just the cell type (luminal or basal) and the fourth column “developmental stage” contains just the stage (virgin, pregnant or lactating).
sampleinfo
We want to compare the groups in the “characteristics” column however the names are quite long so, similar to what we did in the Intro to R session, we’ll make a column containing shorter group names.
# make column called condition with shorter group names
sampleinfo <- mutate(sampleinfo, condition = case_when( str_detect(characteristics, "basal.*virgin") ~ "bvirg",
str_detect(characteristics, "basal.*preg") ~ "bpreg",
str_detect(characteristics, "basal.*lact") ~ "blact",
str_detect(characteristics, "luminal.*virgin") ~ "lvirg",
str_detect(characteristics, "luminal.*preg") ~ "lpreg",
str_detect(characteristics, "luminal.*lact") ~ "llact"
))
sampleinfo
Now we have our counts matrix in the long format we will join it to our sampleinfo so we have information on the samples, what groups they belong to. This is similar to what we did in the Intro to R session.
counts <- full_join(counts, sampleinfo, by = c("sample" = "X1"))
# take a look
counts
We can shorten the sample names. We can remove the GSM1480 prefix that’s present in all of them as shorter names can fit better in some of the plots we will create. We can use mutate() together with str_replace() to remove the GSM1480 string from the sample column.
counts <- counts %>%
mutate(sample=str_remove(sample, "GSM1480"))
Now that we have our data in the format we want we will create a tidyBulk object, that we can use to perform differential expression analysis with the tidyBulk package. For this we need to specify our counts object and the names of the columns that contain our sample ids, our gene identifiers and our counts. Any other columns in the counts object e.g. our Ensembl gene id “X1” column will remain at the end.
#create a 'tt' object
counts <- tidyBulk(counts, sample, gene_symbol, count)
# take a look
counts
Some gene symbols are not unique, they map to more than one gene id. We need to remove this redundancy and we can do that with tidyBulk function aggregate_duplicates(). By default it will aggregate duplicate gene symbols summing their counts. Add way to identify what the duplicates are?
# get rid of duplicated gene symbols
counts <- aggregate_duplicates(counts)
Before testing for differential expression we need to do some preprocessing of the data. We filter lowly expressed genes and also normalise for differences in sequencing depth and composition between the samples. These steps are explained in more details below.
First we will check how many counts we have for each sample. We can do this quite easily by making a bar plot. This helps us see whether there are any major discrepancies between the samples more easily.
# make barplot of counts
ggplot(counts, aes(x=sample, weight=count, fill=condition)) +
geom_bar()
The bar plots show us there are ~20 million counts per sample.
Genes with very low counts across all libraries provide little evidence for differential expression and they interfere with some of the statistical approximations that are used later in the pipeline. They also add to the multiple testing burden when estimating false discovery rates, reducing power to detect differentially expressed genes. These genes should be filtered out prior to further analysis.
There are a few ways to filter out lowly expressed genes. When there are biological replicates in each group, we favour filtering on a minimum count threshold in the smallest group size. This ensures that a gene will be retained if it is only expressed in one group. In this dataset, we choose to retain genes if they are expressed at a counts-per-million (CPM) above 0.5 in at least two samples (our smallest group size). As a general rule, a good CPM threshold can be chosen by identifying the CPM that corresponds to a count of 10, which in this case is about 0.5 as there are ~20 million counts per sample (20/10 = 0.5 CPM). If the count is any smaller, it is considered to be very low, indicating that the associated gene is not expressed in that sample. Smaller CPM thresholds are usually appropriate for larger libraries. So for this dataset we will filter lowly expressed genes using a cpm_threshold of 0.5 and a prop_threshold (proportion of samples) of 2/12 with tidyBulk.
TMM normalisation is performed to eliminate composition biases between libraries [@robinson2010tmm]. This generates a set of normalisation factors, where the product of these factors and the library sizes defines the effective library size. TMM normalisation (and most scaling normalisation methods) scale relative to one sample.
In the tidyBulk package the function scale_abundance() performs the filtering and normalisation to generate normalised counts.
# Normalisation for library size and composition bias (scale counts),
counts.norm <- counts %>% scale_abundance()
# take a look
counts.norm